Assignment 2
Hungry for Apples?
We keep hearing how leaky our phones can be, but what does this actually mean? Can we quantify a phone's leakiness? Does it provide us with all the needed data we need to decide for our selves? How easy is it to tell?
We want internet to be free, while at the same time we want devices and services to get better and better. The incredible thing is that we found an ingenious way to make that happen! Collecting data. Now by simply using services we support them by giving them our data to be sold in aggregate form to ad personalization brokers, in turn profiting from running ads. This systems sounds like we have managed to find a way to provide services for free without the need of external subsidisation. And it clearly works! So far I haven't payed a single dollar for services I use over the internet, like social media, access to encyclopedias, etc.
The apps on my phone that are the protagonists in today's data collection story. Image: self.
Transparency
Without adding additional context to the claim above, it sounds very exciting, however, reading it still makes one uncomfortable. The key ingredient missing from the description of this model for free services is that it does not have a built in rationale for transparency to users. In other words, the mechanism of profit for a company commodifying user data does not require it to make this process transparent. The user is often faced with a "choice" on using certain services, while in reality not using them would result to digital ostracisation. Therefore companies have the to remain vague and abstract on the way they handle user data.
Currently, in certain cases it is possible to ask for such data from a service provider. However, due to the lack of demand there exists no standardized way to make such requests, nor a user appealing and simple way to understand the data when obtained.
In this assignment I am analyzing the data that the apps on my iPhone are accessing and sending over the course of 2 weeks. I will present how this data was obtained, the format, and how easy it was to interpret, before going through and understanding the image someone can gain from looking at this data and its relation to the truth.
What is this data?
I have been using an iPhone for the past year now, and there hasn't been a day that I am not grateful for being able to afford and use such an incredibly user friendly and versatile device. I use it primarily to do things such as check emails, navigation, ordering food, and communication, be that calls over messenger or whatsapp, or messaging in similar platforms. Even though I wouldn't consider myself a heavy user of my phone (I'm a heavy user of my computer see Assignment 1) I still find the versatility, simplicity, and integration of the my iPhone quite remarkable.
However, as mentioned above versatility and good quality often comes at the cost of non transparent data collection from the provider. Apple is no exception. In order for every app you download to be functional, you have to allow it access to some part of your phone's data, such as your contacts, location, microphone, etc. Conscious of this, Apple has developed a feature that tracks these requests, either to your phone's data as well as to the internet and can log them for a short period of time on your device.
This data can be recovered by entering the Privacy Report section of the phone's settings. This allows one to have a look over them through a simple interface. This is an incredible feature towards transparency, as similar data access in other services is often convoluted and hard to decipher. This interface, seen in the image below, is quite intuitive to navigate and contains a lot of information. However, it offers no customization or analysis capabilities, therefore it is hard to contextualize an aggregate form of this data with one's routine.
Screenshot of my privacy report. Image: Self
Fortunately, there is a way to export the raw data to a format that can be analyzed. Specifically, this is exported as a .json file that categorizes the data using a standard format that is meant to be both human and machine readable. Specifically, a snippet of this data looks like this.
Sample data entry from Privacy report json file. Image: self
This file can be later converted to a .csv format with columns that correspond to the categories of each data entry, and rows that correspond to the entries. The final dataset looks like this
Full data set form a month's activity on my phone. Source: self
Analysis
Now that we have a method to store the data in a machine readable way, it is time to perform some analysis. The first thing to note is that the data by itself is not clean. The sheets file I provided above is the exact data and format I obtained out of the privacy report, in .csv. As one can see there are gaps in the fields, not everything has a standardized value, so some cleanup is in order. To do the cleanup and the analysis I used python because of its versatility. This process is highly nonstandard but it will allow is the highest level of control over the data. However, some training might be needed for anyone without a programming background who wants to perform this analysis. The full code, data files, and results can be found here.
Observation #1 > Frequency analysis
The first nontrivial analysis we can do on the data is evident when noticing that each entry on the table has a unique timestamp over that one month period. Using that timestamp one can peer into how often applications are accessing the phone's or the internet's resources. A density plot on the frequency tracking requests are sent over time of day is shown below. In this context density means that area under a region of time is the probability that a notification was sent during it. This is an interesting visualization to see, on average, what are the busiest times when my apps want to talk to my phone. The results were surprising.
Average density of phone activity over time of day. Image: self.
My phone is sending data in my sleep?! This started looking like the starting plot of a dystopian horror movie to me, so I decided to look at the data further.
Observation #2 > Frequency analysis of Internet Usage
There are two main categories listed on the dataset by which apps use the phone's resources. Each entry has that category listed, and broadly speaking, there are two major types of data usage. Over the internet or not. Admittedly I would see why my phone needed to access my phone's sensors, such as ambient light or location, but I was curious to see if it would access the internet. Therefore, I repeated the plot above, this time only for the internet data. And this is what I got.
Average density of internet related phone activity over time of day. Image: self.
This was much more relieving. Even though there is access to my sensor data over night, there is little access to the internet from my applications. At least I'm tracked over the internet at times when i'm consciously using my phone. Funnily enough this graph has a peak at around the time I have lunch, which I'd be really curious to investigate further if it wasn't for the fact that I am even more curious about what type of offline data do my apps need to access.
Observation #3 > Offline data access
There are a little over 10,000 data points on this dataset. Out of all of these only about 600 of them were internet related. The overwhelmingly vast majority of them was data that my apps were accessing offline. I was curious to what that would be so I plotted the categories below. They were less than I expected.
Categories of data apps access on a regular basis alongside with their total count over one month. Image: self.
I find this plot quite interesting. The most requested information from my phone would not be my location, microphone, etc. rather it would be my contacts: an app that I have never opened. This makes sense in retrospect as communication is what I use my phone primarily for, but this was not what I was thinking when I started this experiment.
Observation #4 > Which apps ask for data?
A natural question after seeing what types of data are requested by the apps from my phone is to see which apps are making those requests. Therefore, I created a list of all the apps that requested such information and counted the frequency by which they did. The bar chart is displayed below.
Number of times each app has requested information from my phone. Image: self.
And just in case this long list/visualization is not as intuitive as one would hope, I got you covered! Based on that I created this additional visualization with the app icons from above but with their size corresponding to the amount of information they request. This way it is immediately clear the makeup of my most data consuming apps.
App data usage visualization. Size corresponds to more data used. Image: Self.
Conclusion & Interpretation
The makeup of the bulk of the data my applications were consuming was quite unexpected. Apps like google maps, camera, photos, messenger, were surely expected to be among the top data collectors from my phone, as the constantly access location, contacts, etc. However, other innocent looking apps like email and app store were unexpected additions to this list. Furthermore, as seen in the first plot, there is virtually no time that my phone is not processing some form of private or sensitive data.
But I hear you say: "Panos, how is this even related to the data companies collect from you?" Even though the connection was not obvious to me initially, this analysis proved to me that sensitive data are contained in e every innocuous task that we do. Evan snapping a picture collects information about its location, the faces it sees, even the current weather conditions. It is easy to naively send such an image using services that do not strip them from this metadata. This exploration reveals that sensitive data collection is so prevalent in the devices we use every day that it would not be uncommon that if such sensitive information leaves our grasp for the public domain it would be by us latching onto our own ignorance or complacency.
Ready to Grade (15/05/2022)


